A Simple Scheme for Mining Approximate Evolutionary Trees From Large Scale Data Sets
نویسنده
چکیده
1 Determining the evolutionary relationship of a set of DNA or protein sequences is a notoriously important problem which becomes particularly difficult in the analysis of large sets of molecular sequence data. Presently used hierarchical clustering methods are not suited for such large scale data analysis due to their preventive algorithmic complexities. Here we discuss a simple approach, routed insertion, that overcomes this constraint. Sequences are added to a growing tree using the information provided by reconstructed ancestral sequences so that sequences move along internal nodes until a suitable branch is reached. Routed insertion enables the approximate reconstruction of a phylogenetic tree from character state data in quasi-linear time. Moreover, it is based on a specific model of evolution and employs local maximum likelihood and minimum evolution criteria. The method is applied to infer the evolutionary history of 1968 human and chimpanzee mtDNA HVRI/II sequences. 2 Introduction Statistical methods for reconstructing the evolutionary history of a set of DNA se-growing amount of available data it has become evident that the computational complexities of most of the currently used procedures render them inappropriate for large data sets. For example, the simple strategy of enumerating all possible tree topolo-gies in an exhaustive tree search takes exponential time in the number of sequences. Therefore this approach is impractical for larger numbers of sequences. Alternative strategies, e.g., divide-and-conquer procedures like quartet puzzling (Strimmer and von Haeseler, 1996) or DCM (Huson et al., 1998), have been proposed as heuristics to shortcut the time-consuming search for an optimal maximum-likelihood or par-simony tree. However, the computational complexity of these algorithms still scales in comparatively high polynomial orders (e.g., quartet puzzling requires time proportional to Ò where Ò is the number of sequences) which equally excludes very large data sets. Algorithms based on agglomerative hierarchical clustering (Arabie et al., 1996; Day and Edelsbrunner, 1984) that work on pairwise distances instead of character state data, among them the neighbor-joining method (Saitou and Nei, 1987), offer a faster though usually less accurate alternative. Neighbor-joining demands a computational effort proportional to the cube of the number of sequences analysed (Studier and Keppler, 1988). Though this scaling behaviour is sufficient for many applications it still does not meet the requirements for large scale data analysis. Hein has described a procedure (Hein, 1989a; Hein, 1989b) to select suitable parts from a distance matrix to allow the construction of large trees. For the more …
منابع مشابه
Approximate resistivity and susceptibility mapping from airborne electromagnetic and magnetic data, a case study for a geologically plausible porphyry copper unit in Iran
This paper describes the application of approximate methods to invert airborne magnetic data as well as helicopter-borne frequency domain electromagnetic data in order to retrieve a joint model of magnetic susceptibility and electrical resistivity. The study area located in Semnan province of Iran consists of an arc-shaped porphyry andesite covered by sedimentary units which may have potential ...
متن کاملOptimal Self-healing of Smart Distribution Grids Based on Spanning Trees to Improve System Reliability
In this paper, a self-healing approach for smart distribution network is presented based on Graph theory and cut sets. In the proposed Graph theory based approach, the upstream grid and all the existing microgrids are modeled as a common node after fault occurrence. Thereafter, the maneuvering lines which are in the cut sets are selected as the recovery path for alternatives networks by making ...
متن کاملSearching for simplified farmers' crop choice models for integrated watershed management in Thailand: A data mining approach
This study used the C4.5 data mining algorithm to model farmers’ crop choice in two watersheds in Thailand. Previous attempts in the Integrated Water Resource Assessment and Management Project to model farmers’ crop choice produced large sets of decision rules. In order to produce simplified models of farmers’ crop choice, data mining operations were applied for each soil series in the study ar...
متن کاملEvolutionary Granular Kernel Machines
Kernel machines such as Support Vector Machines (SVMs) have been widely used in various data mining applications with good generalization properties. Performance of SVMs for solving nonlinear problems is highly affected by kernel functions. The complexity of SVMs training is mainly related to the size of a training dataset. How to design a powerful kernel, how to speed up SVMs training and how ...
متن کاملNeural networks for data mining: constrains and open problems
When we talk about using neural networks for data mining we have in mind the original data mining scope and challenge. How did neural networks meet this challenge? Can we run neural networks on a dataset with gigabytes of data and millions of records? Can we provide explanations of discovered patterns? How useful that patterns are? How to distinguish useful, interesting patterns automatically? ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999